Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme

نویسندگان

  • Ningning Liu
  • Emmanuel Dellandréa
  • Liming Chen
  • Chao Zhu
  • Yu Zhang
  • Charles-Edmond Bichot
  • Stéphane Bres
  • Bruno Tellez
چکیده

The text associated with images provides valuable semantic meanings about image content that can hardly be described by low-level visual features. In this paper, we propose a novel multimodal approach to automatically predict the visual concepts of images through an effective fusion of textual features along with visual ones. In contrast to the classical Bag-ofWords approach which simply relies on term frequencies, we propose a novel textual descriptor, namely the Histogram of Textual Concepts (HTC), which accounts for the relatedness of semantic concepts in accumulating the contributions of words from the image caption toward a dictionary. In addition to the popular SIFT-like features, we also evaluate a set of mid-level visual features, aiming at characterizing the harmony, dynamism and aesthetic quality of visual content, in relationship with affective concepts. Finally, a novel selective weighted late fusion (SWLF) scheme is proposed to automatically select and weight the scores from the best features according to the concept to ∗Corresponding author. Tel: (+33)684663045 Email addresses: [email protected] (Ningning Liu ), [email protected] (Emmanuel Dellandréa), [email protected] (Liming Chen), [email protected] (Chao Zhu), [email protected] (Yu Zhang), [email protected] (Charles-Edmond Bichot), [email protected] (Stéphane Bres), [email protected] (Bruno Tellez) Preprint submitted to Computer Vision and Image Understanding November 24, 2012 be classified. This scheme proves particularly useful for the image annotation task with a multi-label scenario. Extensive experiments were carried out on the MIR FLICKR image collection within the ImageCLEF 2011 photo annotation challenge. Our best model, which is a late fusion of textual and visual features, achieved a MiAP (Mean interpolated Average Precision) of 43.69% and ranked 2 out of 79 runs. We also provide comprehensive analysis of the experimental results and give some insights for future improvements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimodal Information Fusion for Semantic Video Analysis

Multimedia data by its very nature contains multimodal information in it. For a successful analysis of multimedia content, all available multimodal information should be utilized. Additionally, since concepts can contain valuable cues about other concepts, concept interaction is a crucial source of multimedia information and helps to increase the fusion performance. The aim of this study is to ...

متن کامل

BUAA-iCC at ImageCLEF 2015 Scalable Concept Image Annotation Challenge

In this working note, we mainly focus on the image annotation subtask of ImageCLEF 2015 challenge that BUAA-iCC research group participated. For this task, we firstly explore textual similarity information between each test sample and predefined concept. Subsequently, two different kinds of semantic information are extracted from visual images: visual tags using generic object recognition class...

متن کامل

LIRIS-Imagine at ImageCLEF 2012 Photo Annotation Task

In this paper, we present the methods we have proposed and evaluated through the ImageCLEF 2012 Photo Annotation task. More precisely, we have proposed the Histogram of Textual Concepts (HTC) textual feature to capture the relatedness of semantic concepts. In contrast to term frequency-based text representations mostly used for visual concept detection and annotation, HTC relies on the semantic...

متن کامل

BBN VISER TRECVID 2011 Multimedia Event Detection System

We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual cooccurrence patterns in videos. For ...

متن کامل

IPAL Knowledge-based Medical Image Retrieval in ImageCLEFmed 2006

This paper presents the contribution of IPAL group on the CLEF 2006 medical retrieval task (i.e. ImageCLEFmed). The main idea of our group is to incorporate medical knowledge in the retrieval system within a multimodal fusion framework. For text, this knowledge is in the Unified Medical Language System (UMLS) sources. For images, this knowledge is in semantic features that are learned from exam...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computer Vision and Image Understanding

دوره 117  شماره 

صفحات  -

تاریخ انتشار 2013